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Abstract 

We analyze the Dawid-Rissanen prequential maximum likelihood codes relative to one- 
parameter exponential family models M.. If data are i.i.d. according to an (essentially) 
arbitrary P, then the redundancy grows at rate iclnn. We show that c = o\l<j\, where a\ 
is the variance of P, and a\ is the variance of the distribution M* € M. that is closest to 
P in KL divergence. This shows that prequential codes behave quite differently from other 
important universal codes such as the 2-part MDL, Shtarkov and Bayes codes, for which 
c = 1 . This behavior is undesirable in an MDL model selection setting. 

1 Introduction 

Universal coding lies at the basis of on-line prediction algorithms for data compression and 
gambling purposes. It has been extensively studied in the COLT community, typically under 
the name of 'sequential prediction with log loss', see, for example (HID El- It also underlies 
Rissanen's theory of MDL (minimum description length) learning [2J and Dawid's theory of 
prequential model assessment 0. Roughly, a code is universal with respect to a set of candidate 
codes M if it achieves small redundancy: it allows one to encode data using not many more 
bits than the optimal code in Ai. The redundancy is very closely related to the expected regret. 
which is perhaps more widely known within the COLT community - we compare the two notions 
in Section 21 The main types of universal codes are the Shtarkov or NML code, the Bayesian 
mixture code, the 2-part MDL code and the prequential maximum likelihood (ML) code, also 
known as the 'ML plug-in code' or the 'predictive MDL code' 0^^. This code was introduced 
independently by Rissanen [201 in the context of MDL learning and by Dawid [Zj, who proposed 
it as a probability forecasting strategy rather than directly as a code. The underlying ideas are 
explained in Section |2I Here we study the case where no code in A4 corresponds to the data 
generating distribution P. We find that in this case, the redundancy of the prequential code can 
be quite different from that of the other three methods. Specifically, if M is a one-dimensional 
exponential family, then the redundancies are ^clnn + 0(1). Whereas it is known that for 
the Bayes, NML and 2-part codes, under regularity conditions on P and A4, we have c = 1 
(Section |IJ, we determine c for the prequential code and find that, depending on properties of 
P and M, it can be either larger or smaller than 1. 

Relevance Our result has at least three important consequences, which are discussed further 
in Section [SJ 

1. Practical consequence for data compression When prequential codes are used for data 
compression in the realistic situation that P A4, then depending on the situation they 
can behave either better or worse than the Bayesian and NML codes (end of Sectional). 



2. Practical consequence for MDL learning/model selection In the case of model se- 

lection between two nonoverlapping parametric models, our results suggest (but do not 
prove) that the prequential plug-in codes typically behave worse (and never better) than 
the Bayesian or NML code. We have experimental evidence for this for the Poisson and 
geometric families. 

3. Theoretical Our result implies that, under misspecification, the Kullback-Leibler (KL) risk 

of efficient estimators behaves in a fundamentally different way from the KL risk of es- 
timators such as the Bayes predictive distribution which are not restricted to lie in the 
model M under consideration. 

Contents The remainder of the paper is organized as follows. In Section[2]we informally state 
and explain our result, and we discuss how it relates to previous results. Section |3] contains the 
formal statement of our main result (Theorem as well as a brief proof sketch. We show that 
a version of our result still holds if 'redundancy' is replaced by 'expected regret' in Section |IJ 
We discuss further issues regarding our result in Section 03 We explain the relevance of our 
result, including the consequences listed above, in Section H3 Section [7| proves our main result. 
The proof makes use of several lemmas which are stated and proven in Section |S1 The second 
result, discussed in Section |IJ is proven in Section |§1 The paper ends with a conclusion. 

2 Main Result, Informally 

Suppose M = {Mq : 6 G 0} is a A;-dimensional parametric family of distributions, and 
Z\, Z%, . . . are i.i.d. according to some distribution P G M. The redundancy of a universal code 
U with respect to P is defined as 

Ku(n) :=E P [L u (Z 1 ,...,Z n )] - inf E P [-lnM e (Z u . . . , Z n )], (1) 

where Ljj is the length function of U and Mg(Z\, . . . , Z n ) denotes the probability mass or density 
of Z\ , . . . , Z n under distribution Mq ; these and other notational conventions are detailed in 
Section El By the information inequality '(3 the second term is minimized for Mq = P, so that 

Ku(n) = E P [Lu(Zx, ...,Z n )]- E P [- In P(Z U ...,Z n )}, (2) 

Thus, © can be interpreted as the expected number of additional nats one needs to encode 
n outcomes if one uses the code U instead of the optimal (Shannon-Fano) code with lengths 
— lnP(Zi, . . . , Zn). A good universal code achieves small redundancy for all or 'most' P G M 
(the relation to the concept of 'regret' is discussed in Section 0J. 

The four major types of universal codes, Bayes, NML, 2-part and prequential ML, all achieve 
redundancies that are (in an appropriate sense) close to optimal. Specifically, under regularity 
conditions on M and its parameterization, these four types of universal codes all satisfy, for all 
P G A4, 

K(n) = ^lnn + 0(l), (3) 

where the O(l) may depend on 9 and the universal code used. © is the famous l k over 2 
log n formula', refinements of which lie at the basis of most practical approximations to MDL 
learning, see 

In this paper we consider the case where the data are i.i.d. according to an arbitrary P 
not necessarily in the model M . To emphasize that the redundancy is measured relative to the 
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element of the model that minimizes the codelength rather than to P, we use the term relative 
redundancy rather than just redundancy. Its definition remains unchanged, but it can no 
longer be rewritten as ©: Assuming it exists and is unique, let Mq* be the element of A4 that 
minimizes KL divergence to P: 

9* := argmm D(P\\M e ) = argmin E P \- In M d (Z)}, 
eee eee 

where the equality follows from the definition of the KL divergence 0. Then the relative 
redundancy satisfies 

K u (n) = Ep[L u (Z 1 ,...,Z n )}-E P [-\nM e *(Z 1 ,...,Z n )]. (4) 

It turns out that for the NML, 2-part MDL and Bayes codes, the relative redundancy (jlj with 
P M., still satisfies (J2J), at least under conditions on M. and P; see Section 0J In this paper, 
we show for the first time that (jHJ) does not hold for the prequential ML code. The prequential 
ML code U works by sequentially predicting Zi + \ using a (slightly modified) ML or Bayesian 
MAP estimator (9j = 0{z l ) based on the past data, that is, the first i outcomes z % = z\, . . . , Z{. 
The total codelength Ljj(z n ) on a sequence z n is given by the sum of the individual 'predictive' 
codelengths (log losses): L v (z n ) = Y%~q[- mM g i z i+i)}- In 

our main theorem, we show that 
if L\j denotes the prequential ML code length, and Ai is a regular one-parameter exponential 
family (A; = 1), then 

^(n) = i-^^lnn + 0(l), (5) 

where X is the sufficient statistic of the family. In Example ^ below we give an example of 
the phenomenon. The result holds as long as A4 and P satisfy Condition ^ defined below. 
Essentially, as long as the fourth moment of P exists, the condition holds for all exponential 
families we checked, including the Poisson, geometric, exponential, normal with fixed mean or 
variance and Pareto distributions. The result indicates that the redundancy can be both larger 
and smaller than i Inn, depending on the variance of the 'true' P. We can only guarantee that 
the two variances are the same if P £ A4, in which case Mq* = P. It immediately follows that in 
practical data compression tasks, whenever P ^ A4, the redundancy of the prequential ML code 
can be both smaller and larger than that of the Bayesian code, depending on the situation. This 
is the first of the three implications of our result, listed in Section ^ We postpone discussion 
of the other two implications to Section H3 

Example 1 Let M be the family of Poisson distributions, parameterized by their mean [i. 
Then the ML estimator (n is the empirical mean of Z\, . . . , Zi. Suppose Z, Z%, Z2, ... are i.i.d. 
according to a degenerate P with P(Z = 4) = 1. Since the sample average is a sufficient statistic 
for the Poisson family, jli will be equal to 4 for all i > 1. On the other hand, //*, the parameter 
(mean) of the distribution in A4 closest to P in KL-divergence, will be equal to 4 as well. Thus 
the redundancy @ of the prequential ML code is given by 

n— 1 n— 1 

Ku{n) = Yy ln M h ( 4 ) + ln M ^ ( 4 )] = " ln M Ao (4) + In M 4 (4) + £ [- In M 4 (4) + In M 4 (4)] 

i=0 i=l 

= -lnM A0 (4) + lnM 4 (4) = O(l), (6) 

assuming an appropriate definition of /to. In the case of the Poisson family, the outcome Z is 
equal to the sufficient statistic X in ©. Since vavpZ = 0, this example agrees with (JSJ). 
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Related Work There are a plethora of results concerning the redundancy and/or the regret 
for the prequential ML code, for a large variety of models including multivariate exponential 
families, ARMA processes, regression models and so on. Examples are [2211101 [TH1 1251 117j . In all 
these papers it is shown that either the regret or the redundancy grows as ^ Inn + o(lnn), either 
in expectation or almost surely. ^7] even evaluates the remainder term explicitly. The reason 
that these results do not contradict ours, is that in all these papers, one studies the case where 
the generating distribution P is in the model, in which case automatically var^* (X) = varp(X). 
In other cases 01^, regret of a prequential ML-type code is evaluated on an individual sequence 
basis, and it is found that the regret grows as | Inn + 0(1) for all sequences whose ML estimator 
remains bounded away from the boundary of the space. The reason that these results do not 
contradict ours, is that in all cases that have been examined (and that we know of), the model 
is complete, i.e. it contains all distributions that can be defined on the sample space for 1 
outcome. Then, if data are i.i.d. according to some P, P must be in Ai, and we automatically 
get varM*(^0 = varp(X). An example is 9 which uses the Bernoulli model. Apparently, we are 
the first to study the redundancy and regret for incomplete models under general circumstances. 

3 Main Result, Formally 

In this section, we introduce our notation, we define our quantities of interest, we state our 
main result and we give a short idea of the proof. The complete proof is given in Sections 17181 

Notational Conventions Throughout this text we use nats rather than bits as units of 
information. Outcomes are capitalized if they are to be interpreted as random variables instead 
of instantiated values. A sequence of outcomes zi,...,z n is abbreviated to z n . We write Ep 
as a shorthand for Ez^p, the expectation of Z under distribution P. When we consider a 
sequence of n outcomes independently distributed ~ P, we use Ep even as a shorthand for the 
expectation of (Z%, . . . , Z n ) under the n-fold product distribution of P. Finally, P(Z) denotes 
the probability mass function of P in case Z is discrete- valued, and it denotes the density of P, 
in case Z takes its value in a continuum. When we write 'density function of Z\ then, if Z is 
discrete- valued, this should be read as 'probability mass function of Z\ Note however that in 
our main result, Theorem ^ below, we do not assume that the data generating distribution P 
admits a density. 

Exponential Families Let Z be a set of outcomes, taking values either in a finite or countable 
set, or in a subset of /c-dimensional Euclidean space for some k > 1. Let X : Z — ► R be a random 
variable on Z, and let X = {x £ R : 3z £ Z : X(z) = x} be the range of X. 

Exponential family models are families of distributions on Z defined relative to a random 
variable X (called 'sufficient statistic') as defined above, and a function h : Z — > [0, do). We 
let Z(n) := f zeZ e^ r,xi - z ' > h{z)dz (where the integral is to be replaced by a sum for countable Z), 
and we let := {n G R : Z(n) < oo}. 

Definition 1 (Exponential family) The single parameter exponential family ^3] with suffi- 
cient statistic X anticarrier h is the family of distributions with densities M v (z) := i^^e - ^^ h{z) , 
where f/£ 8,. is called the natural parameter space. The family is called regular if 0^ is 
an open interval o/R. 

In the remainder of this text we only consider single parameter, regular exponential families 
where the mapping from Q v to the corresponding set of distributions is 1-to-l, but these qualifi- 
cations will henceforth be omitted. Examples of this wide family of models include the Poisson, 
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geometric and multinomial families, and the model of all Gaussian (normal) distributions with 
a fixed variance, or with a fixed mean. In the first four cases, we can take X to be the identity, 
so that X = Z and X = Z. In the case of the normal family with fixed mean, a 2 becomes the 
sufficient statistic and we have Z = R, X = [0, oo) and X = Z 2 . 

The statistic X{z) is sufficient for n |14j . This suggests reparameterizing the distribution by 
the expected value of X, which is called the mean value parameterization. The function fj,(rj) = 
Em v [X] maps parameters in the natural parameterization to the mean value parameterization. 
It is a diffeomorphism (it is one-to-one, onto, infinitely often differentiable and has an infinitely 
often differentiable inverse) |14j . Therefore the mean value parameter space U is also an open 
interval of R. We note that for some models (such as Bernoulli and Poisson), the parameter 
space is usually given in terms of the a non-open set of mean- values (e.g., [0, 1] in the Bernoulli 
case). In this case, to make the model a regular exponential family, we have to restrict the set 
of parameters to its own interior. Henceforth, whenever we refer to a standard statistical model 
such as Bernoulli or Poisson, we assume that the parameter set has been restricted in this sense. 

We are now ready to define the prequential ML model. This is a distribution on infinite 
sequences zi, z%, ■ ■ ■ £ Z°°, recursively defined in terms of the distributions of Z n+ \ conditioned 
on Z n = z n , for all n = 1,2,..., all z n = (z±, . . . , z n ) S Z n . In the definition, we use the 
notation Xi := X(zi). 

Definition 2 (Prequential ML model) Let © u be the mean value parameter domain of an 
exponential family M. = {M^ \ [i G ©u}- Given M. and constants xq 6 0^ and uq > 0, we 
define the prequential ML model U by setting, for all n, all z n+l £ Z n+1 : 

U{z n+ i \z n ) = M A(z „)(2 n+ i), 

where U(z n +i \ z n ) is the density/mass function of z n+ \ conditional on Z n = z n , 

A(*») := 

n + n 

and Mm z u\(-) is the density of the distribution in M with mean fi(z n ). 

We henceforth abbreviate fi(z n ) to jl n . We usually refer to the prequential ML model in terms 
of the corresponding codelength function 

n—1 n—1 



Lu(z n ) =Y^Lu{z i+ i I Zi ) = J^-lnMfcte+i). 



2=0 1=0 



To understand this definition, note that for exponential families, for any sequence of data, 
the ordinary maximum likelihood parameter is given by the average ^ X{ of the observed 
values of X |14j . Here we define our prequential model in terms of a slightly modified maximum 
likelihood estimator that introduces a 'fake initial outcome' xq with multiplicity no in order to 
avoid infinite code lengths (see the quote by Rissanen on "inherent singularity" in Section EJ) and 
to ensure that the prequential ML code length of the first outcome is well-defined. In practice 
we can take uq = 1 but our result holds for any no > 0. This definition can be reconciled with 
settings in which the startup problem is resolved by ignoring the first few outcomes, by setting 
xq to the ML estimator for the ignored outcomes and no to their number. It also allows our 
results to be generalized to a number of other point estimators as discussed in Section [5.21 

With all our definitions in place we can state our main result. 
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Theorem 1 (Main result) Let X,Xi,X 2 ,. ■ ■ be i.i.d. ~ P, with E P [X] = fi* . Let M be a 

single parameter exponential family with sufficient statistic X and fi* an element of the mean 
value parameter space. Finally let U denote the prequential ML model with respect to M. If M 
and P satisfy Condition 1 below, then 

n u (n) = ^-\lnn + 0{l). 

var Mfi » X 2 

To reconcile this with the informal statement © , notice that M^* is the element of M achieving 
the smallest expected codelength, i.e. it achieves inf MS e M D{P\\M fJ ) [Tl| . 

Condition 1 We require that the following holds both for T := X and T := —X: 

• IfTis unbounded from above then there is a k £ {4, 6, . . .} such that the first k moments 



ofT exist under P and that -j^D(M^* ||M M ) = O f// fc ~ 6 ) 



,4 

• IfT is bounded from above by a constant g then -j^D{M ll * is polynomial in l/(g—fj 1 ). 

The condition implies that Theorem ^ can be applied to most single-parameter exponential 
families that are relevant in practice. To illustrate, we have computed the fourth derivative of 
the divergence for a number of exponential families; all parameters beside the mean are treated 
as fixed values. The results are listed in Figure ^ As can be seen from the figure, for these 
exponential families, our condition applies whenever the fourth moment of P exists. Note in 
particular that the condition requires v&ipX < oo. 

The reason why we need Conditionals best explained by sketching the proof of Theorem ^ 

Brief Proof Sketch The precise proof of Theorem ^ given in Section [7J is very technical. 
Here we merely describe the underlying ideas, which are relatively simple. Consider first the 
case no = 0, so that for n > 1, jX n is just the standard ML estimator. Let z l be the initial 
i outcomes of an arbitrary sequence z n = zi, z 2 , ■ ■ ■ , z n . As is well-known, a straightforward 
second-order Taylor expansion of D(M^\\M^ z i^) around /x* gives 

D(M r ||M Ai ) = ~I(//)(Ai - /i*) 2 + Remainder. (7) 

Here is the Fisher information in one observation, evaluated at /i*, see Section For 

exponential families in their mean-value parameterization, another standard result ^1] says 
that for all fi, 

m = -^-y ( 8 ) 

Therefore, ignoring the remainder term and the term for i = 0, we get 
n— 1 -, n— 1 ,-, / ^ 



t=0 t=l ' l 



n— 1 

1 varpX v-^ 1 1 varpX , ^. . 

> - = i-— Inn + 0(1, (9 

2var M „^ Z t * 2varM X K h K 1 

M 1 = 1 f 



Here the first approximate equality follows by and (JHJ). The second follows because for 
exponential families, the ML estimator pn is just the empirical average i~ l ^ Xi, so that Ep(fii — 
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Figure 1: -j^D{M il *\\M lJL ) for a number of exponential families. For the normal distribution 
we use mean 0, and we list a reparametrization of the density function such that the density of 
the squared outcomes is given as a function of the variance, which is confusingly but correctly 
called fi* here: the random variable X in Theorem ^ is really the observed value of z 2 rather 
than z itself, so that its mean is E[X] = E[Z 2 ], which is the variance of the normal distribution. 



fj,*) 2 = var(z _1 X^7=i Xj) = i _1 varX. Thus, Theorem ^ follows if we can show (a) that the 
left-hand side of © is equal to the relative redundancy 1Zu(n) and (b) that, as n — > oo, the 
remainder terms in lj7]). summed over n as in ©, form a convergent series (i.e. sum to something 
finite). Result (a) follows relatively easily by rewriting the sum using the chain rule for relative 
entropy and using the fact that X is a sufficient statistic (Lemma EJ). The truly difficult part of 
the proof is (b), shown in Lemma ITT1 It involves infinite sums of expectations over unbounded 
fourth-order derivatives of the KL divergence. To make this work, we (1) slightly modify the ML 
estimator by introducing the initial fake outcome xq. And (2), we need to impose Condition ^ 
To understand it, consider the case T = X, X unbounded from above. The condition essentially 
expresses that, as jl increases to infinity, the fourth order Taylor-term does not grow too fast. 
Similarly, if X is bounded from above by g, the condition ensures that the fourth-order term 
grows slowly enough as p, f g. The same requirements are imposed for decreasing jl. 

4 Redundancy vs. Regret 

The 'goodness' of a universal code relative to a model A4 can be measured in several ways: rather 
than using redundancy (as we did here), one can also choose to measure codelength differences 
in terms of regret, where one has a further choice between expected regret and worst-case regret 
0. Here we only discuss the implications of our result for the expected regret measure. 

Let Ai = {Mg | 6 £ 0} be a family of distributions parameterized by O. Given a sequence 
z n = zi, . . . , z n and a universal code U for M with lengths Ljj, the regret of U on sequence z n 
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is defined as 



Lu(z n )- mf[- In M e (z n )}. (10) 
<9ee 



Note that if the (unmodified) ML estimator 9{z n ) exists, then this is equal to Ljj(z n ) + 
lnM^ 2 „j(z"). Thus, one compares the codelength achieved on z n by U to the best possible 
that could have been achieved on that particular z n , using any of the distributions in A4. As- 
suming Z±, Z2, ■ ■ ■ are i.i.d. according to some (arbitrary) P, one may now consider the expected 
regret 

Kv(n) := E P [Lu(Z n ) - inf [- In M e (Z n )]}, 
To quantify the difference between redundancy and expected regret, consider the function 
din) := inf E P \-lnM e (Z n )} - E P [inf f- In M B (Z n )\ 

and note that for any universal code, lZjj(n) — TZjj(n) = d(n). In case P G M, then under 
regularity conditions on A4 and its parameterization, it can be shown [Sj that 

lim d(n) = J, (11) 

where k is the dimension of Ai. In our case, where P is not necessarily in Ai, we have the 
following: 



Theorem 2 Let X be finite. Let P, Mn and u* be as in Theorem^ Th 



en 



, . 1 varpX 
lim din) = 12 

Once we are dealing with 1-parameter families, in the special case that P G M, this result 
reduces to (|11|) . We conjecture that, under a condition similar to Condition ^ the same result 
still holds for general, not necessarily finite or countable or bounded X, but at the time of 
writing this submission we did not yet find the time to sort out the details. In any case, our 
result is sufficient to show that in some cases (namely, if X is finite), we have 

^(n) = i-^L]nn + 0(l), 
2 var Mfl *A 

so that, up to 0(l)-terms, the redundancy and the regret of the prequential ML code behave 
in the same way. 

Incidentally, Theorem [2] can be used to substantiate the claim we made in Section [21 which 
stated that the Bayes (equipped with a strictly positive differentiable prior), NML and 2-part 
codes still achieve relative redundancy of |lnn if P 7^ Ai, at least if X is finite. Let us 
informally explain why this is the case. It is easy to show that Bayes, NML and (suitably 
defined) 2-part codes achieve regret |lnn + 0(l) for all sequences z\,Z2,-- - such that 9{z n ) 
is bounded away from the boundary of the parameter space M, for all large n [21 E[ • It then 
follows using, for example, the Chernoff bound that these codes must also achieve expected 
regret ^lnn + 0(1) for all distributions P on X that satisfy £p[X] = u* G © M . Theorem [21 
then shows that they also achieve relative redundancy | Inn + 0(1) for all distributions P on 
X that satisfy Ep[X] = u* G 0^. We omit further details. 
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5 Variations of Prequential Coding 



5.1 Justifying Our Modification of the ML Estimator 

If the prequential code is based on the ordinary ML estimator (no = in Definition [2J then, 
apart from being undefined for the first outcome, it may achieve infinite codelengths on the 
observed data. A simple example is the Bernoulli model. If we first observe z\ = and then 
£2 = 1, the codelength of Z2 according to the ordinary ML estimator of Z2 given z\ would be 
— lnM^(zi)(z2) = — InO = oo. There are several ways to resolve this problem. We choose to 
add an 'initial fake outcome'. Another possibility that has been suggested (e.g., |2j) is to use 
the ordinary ML estimator, but only start using after having observed m examples, where m 
is the smallest number such that — lniW^^m^(^ m ^ i) is guaranteed to be finite, no matter what 
value Z m+ i is realized. The first m outcomes may then be encoded by repeatedly using some 
code Lq on outcomes of Z, so that for i < m, the codelength of Zi does not depend on the 
outcomes z l ~ l . In the Bernoulli example, one could for example use the code corresponding to 
P(Zi = 1) = 1/2, until and including the first i such that z 1 includes both a and a 1. Then it 
takes i bits to encode the first z % outcomes, no matter what they are. After that, one uses the 
prequential code with the standard ML estimator. It is easy to see (by slight modification of the 
proof) that our theorem still holds for this variation of prequential coding. Thus, our particular 
choice for resolving the startup problem is not crucial to obtain our result. The advantage of 
our solution is that, as we now show, it allows us to interpret our modified ML estimator also 
as a Bayesian MAP and Bayesian mean estimator, thereby showing that the same behavior can 
be expected for such estimators. 

5.2 Prequential Models with Other Estimators 

The Bayesian MAP estimator If a conjugate prior is used, the Bayesian maximum a- 
posteriori estimator can always be interpreted as an ML estimator based on the sample and 
some additional 'fake data' ([3]; see also the notion of ESS (Equivalent Sample Size) Priors 
discussed in, for example, |15j). Therefore, the prequential ML model as defined above can 
also be interpreted as a prequential MAP model for that class of priors, and the whole analysis 
carries over to that setting. 

The Bayesian mean estimator It follows by the work of Hartigan |12[ Chapter 7] on the so- 
called 'maximum likelihood prior', that by slightly modifying conjugate priors, we can construct 
priors such that the Bayesian mean rather than MAP estimator is of the form of our modified 
ML estimator. 

A Conjecture In some special cases, for example, the Bernoulli model, the exponential family 
A4 covers all distributions that can be defined on X. In such cases, there exists no distribution 
with mean u* and variance not equal to varjvf M » A, and the ^ Inn + 0(1) redundancy can always 
be achieved. But in all other cases, three very reasonable and efficient ^5] estimators (ML, 
Bayes MAP, Bayes mean for a large class of reasonable priors) cannot achieve ^ lnn + 0(l) in all 
circumstances. This suggests that no matter what in-modeft estimator is used, the prequential 
model cannot yield a relative redundancy of ^ In n independently of the variance of the data 
generating distribution P. 

1 See Section 
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5.3 Rissanen's Predictive MDL Approach 

The MDL model selection criterion that is based on comparing the prequential ML codelengths 
for the models under consideration is called the Predictive MDL (PMDL) criterion by Rissanen 
[22] . It is closely related to the Predictive Least Squares (PLS) criterion [21] for regression 
models; PMDL can be seen as an MDL justification for it. There has been some discussion on 
how to use PMDL when the data are not ordered. The prequential ML codelength then becomes 
redundant: the same data can be coded in any order, yielding different code words. Rissanen 
suggests in [2B] to use the permutation of the outcomes that minimizes the codelength. Under 
such a regime, Theorem ^ is no longer applicable (since the outcomes are no longer i.i.d.); 
however Example ^ illustrates that circumstances in which the prequential ML codelength 
and the NML codelength behave very differently remain, under any regime that amounts to 
reordering the sample, including the one suggested by Rissanen. 

6 Consequences 

Why are these results interesting? We listed three significant implications in Section ^ the 
introduction to this paper. The first was evident from Theorem ^ Let us now discuss the 
second and third in more detail. 

Practical significance for Model Selection There exist a plethora of results showing that 
in various contexts, if P E Ai, then the prequential ML code achieves optimal redundancy (see 
Section [2J Related Work). These strongly suggest that it is a very good alternative for (or at 
least approximation to) the NML or Bayesian codes in MDL model selection. Indeed, quoting 
Rissanen |24j : 

"If the encoder does not look ahead but instead computes the best parameter 
values from the past string, only, using an algorithm which the decoder knows, 
then no preamble is needed. The result is a predictive coding process, one which 
is quite different from the sum or integral formula in the stochastic complexity. 2 
And it is only because of a certain inherent singularity in the process, as well as 
the somewhat restrictive requirement that the data must be ordered, that we do not 
consider the resulting predictive code length to provide another competing definition 
for the stochastic complexity, but rather regard it as an approximation." 

Our result however shows that the prequential ML code may behave quite differently from the 
NML and Bayes codes, thereby strengthening the conclusion that it should not be taken as 
a definition of stochastic complexity. Although there is only a significant difference if data 
are distributed according to some P Ai, the difference is nevertheless very relevant in an 
MDL model selection context with nonoverlapping models, even if one of the models under 
consideration does contain the 'true' P. To see this, suppose we are comparing two models 
Aii and Ai2 for the same data, and in fact, P £ Ai\ UM^ For concreteness, assume Ai\ 
is the Poisson family and Aii is the geometric family. We want to decide which of these two 
models best explains the data. According to the MDL Principle, we should associate with 
each model a universal code (preferably the NML code). We should then pick the model such 
that the corresponding universal codelength of the data is minimized. Now suppose we use 
the prequential ML codelengths rather than the NML codelengths. Without loss of generality 
suppose that P £ Ai\. Then P Ai2- This means that the codelength relative to Ai\ behaves 

2 The stochastic complexity is the codelength of the data zi,...,z n that can be achieved using the NML code. 
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essentially like the NML codelength, but the codelength relative to A\i behaves differently - 
at least as long as the variances do not match (which for example, is forcibly the case if Ai\ 
is Poisson and M2 is geometric). This introduces a bias in the model selection scheme. We 
have found experimentally |Hj that the error rate for model selection based on the prequential 
ML code decreases more slowly than when other universal codes are used. Even though in 
some cases the redundancy grows more slowly than ^ Inn, so that the prequential ML code 
is in a sense more efficient than the NML code, model selection based on the prequential ML 
codes behaves worse than Bayesian and NML-based model selection. We provide a theoretical 
explanation for this phenomenon in The practical relevance of this phenomenon stems from 
the fact that the prequential ML codelengths are often a lot easier to compute than the Bayes 
or NML codes, so that they are often used in applications |18[ lift]. 

Theoretical Significance The result is also of theoretical-statistical interest: our theorem 
can be re-interpreted as establishing bounds on the asymptotic Kullback-Leibler risk of density 
estimation using ML and Bayes estimators under misspecification (P Ai). Our result implies 
that, under misspecification, the KL risk of estimators such as ML, which are required to lie in 
the model Ai, behaves in a fundamentally different way from the KL risk of estimators such as 
the Bayes predictive distribution, which are not restricted to lie in Ai. Namely, we can think 
of every universal model U defined as a random process on infinite sequences as an estimator 
in the following way: define, for all n, 

P n := Pr(Z n+ i = • I Z\ = z±, . . . , Z n = z n ), 

a function of the sample z±,... ,z n . P n can be thought of as the 'estimate of the true data 
generating distribution upon observing z%, . . . , z n \ In case U is the prequential ML model, 
P n = Mq is simply our modified ML estimator. It is now important to note that for other 

universal models, P n is not required to lie in Ai. An example is the Bayesian universal code 
defined relative to some prior w. This code has lengths L'(z n ) := — In/ M^(z n )w(fi)dfi 11 . 
The corresponding estimator is the Bayesian posterior predictive distribution PB&yes( z i+i I z 1 ) := 
f Mfj,(Zi+i)w((i I z l )dfj, The Bayesian predictive distribution is a mixture of elements of Ai. 
We will call standard estimators like the ML estimator, which are required to lie in Ai, in-model 
estimators. Estimators like the Bayesian predictive distribution will be called out-model. 

Let now P n be any estimator, in-model or out-model. Let P z n be the distribution estimated 
for a particular realized sample z n . We can measure the closeness of P 2 « to M^* , the distribution 
in Ai closest to P in KL-divergence, by considering the extended KL divergence 

D*(M^\\P zn ) =E Z ^ P [- In P z n(Z) - [-lnM^(Z)]]. 

We can now consider the expected KL divergence between M^* and P n after observing a sample 
of length n: 

Ez^z^plD^M^WPn)}. (13) 

In analogy to the definition of 'ordinary' KL risk [2j, we call (|13|) the extended KL risk. We 
recognize TZjj(n), the redundancy of the prequential ML model, as the accumulated expected KL 
risk of our modified ML estimator (see Proposition ^1 and Lemma In exactly the same way 
as for the prequential ML code, the redundancy of the Bayesian code can be re-interpreted as 
the accumulated KL risk of the Bayesian predictive distribution. With this interpretation, our 
Theorem n expresses that under misspecification, the cumulative KL risk of the ML estimator 
differs from the cumulative KL risk of the Bayes estimator by a term of 0(ln n) . If our conjecture 
that no in- model estimator can achieve redundancy | Inn + 0(1) for all fj,* and all P with finite 



11 



variance is true (Section I5.2|) . then it follows that the KL risk for in- model estimators behaves 
in a fundamentally different way from the KL risk for out-model estimators, and that out-model 
estimators are needed to achieve the optimal constant c = 1 in the redundancy ^clnra + 0(l). 



7 Proof of Theorem [T] 



Preliminaries Note that, for any M^^M^i S we have 



Ep[-lnM^Z)} - E P [-\nM^(Z)\ 



r)(n)Ep[X(Z)} +lnZ(77(/i)) + E P [-\nh{Z)\ 
r,(fx')E P [X(Z)} - InZM//)) - E P [- \nh{Z)} 
E P [- In M^X)} - E P [- In M^(X)}, 



so that we have 



Proposition 3 



Ku(n) = E P [- In M(X n )} - inf E P [- lnM^(X n )]. 



Proposition |3 shows that relative redundancy, which is the sole quantity of interest in the proof, 
depends only on the value of X, not Z. Thus, in the proof of Theorem ^ as well as all the 
Lemmas and Propositions it makes use of, we will never mention Z again. Whenever we refer 
to a 'distribution' we mean a distribution of random variable X, and we also think of the data 
generating distribution P in terms of the distribution it induces on X rather than Z. Whenever 
we say 'the mean' without further qualification, we refer to the mean of the random variable 
X. Whenever we refer to the Kullback-Leibler (KL) divergence between P and Q, we refer to 
the KL divergence between the distributions they induce for X (the reader who is confused by 
this may simply restrict attention to exponential family models for which Z = X, and consider 
X and Z identical). 

The proof refers to a number of theorems and lemmas which will be developed in Sec- 
tion |H1 In the statement of all these results, we assume, as in the statement of Theorem that 
X,X\,X2, ■ ■ ■ are i.i.d. ~ P and that fi* is the mean of X under P. If X takes its values in a 
countable set, then all integrals in the proof should be read as the corresponding sums. 

Proof (of Theorem^) From Lemma 03 we have: 



Here, fii is a random variable that takes on values according to P, while /u* is fixed. We first 
abbreviate 5i = fii - [i* and ■f^D(M lJL *\\M lJL ) = D^ k \^i), That is, D^ k \^) is the k-th derivative 
of the function /(/x) := D{M ll *\\M lJi ). We now Taylor-expand the divergence around //*: 



The last term is the remainder term of the Taylor expansion, in which fi £ The second 



n-l 




(14) 



D{M,4M h ) =0 + 6^^*) + ^ 2 V) + ^£ (3 V) + |^ (4) (M) 
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which, evaluated at /x*, resembles the Fisher information. Fisher information is usually defined 
as 1(6) := E hxf{X | #)) 2 ] , but as is well known |14j . for exponential families this is equal 

to —-jj§zE\lnf(X | 6)], which matches D@\-) exactly. Combining this with (jHJ) (Sectional), we 
obtain: 

D(M^\\M fii ) = ^ 2 /var AV (X) + jU^V) + ^D^fr) (15) 
We plug this expression back into Equation El giving 

R " ( " ) = 2va r ,',(X) g £p ^ 1+fl(n) - (16) 
where the remainder term R(n) is given by 

n-1 



i=o Al ~ M * 



^ 3 Z^V) + ^WV) 



(17) 



where \x and <5j are random variables depending on /}, and i. In Lemma ^2 we show that 
i?(n) = O(l), giving: 

1 71—1 

U ^ n) = + 2var M ^(X) g ^ ^ " ^ (18) 

Note that /tj is almost the ML estimator. This suggests that each term in the sum of (|18[) 
should be almost equal to the variance of the ML estimator, which is vaxX/i. Because of the 
slight modification that we made to the estimator, we get a correction term of 0((i + 1)~ 2 ) as 
established in Theorem |SJ This theorem gives: 

n— 1 n— 1 n— 1 

i=0 i=0 i=0 

= O(l) + var P (X)lnn (19) 

The combination of QlKjl and (|19|) completes the proof. □ 



8 Building blocks of the proof 

The proof of Theorem ^ is based on Lemma El and Lemma 1111 These Lemmas are stated and 
proved, respectively, in Section f8, 21 and T8. 31 The proofs of Theorem ^ and Theorem as well as 
the proof of both Lemmas, are based on a number of generally useful results about probabilities 
and expectations of deviations between the average and the mean of a random variable. Below, 
we first, in Section f8. 11 list these deviation-related results. 



8.1 Results about Deviations between Average and Mean 



Lemma 4 Let X,Xx,X%, . . . be i.i.d. with mean 0. Then we have E 



GXi^) 2 =nvar(X). 



Proof For n = the lemma is obviously true. Suppose it is true for some n. For brevity we 
write s n = Y17=i ^i- Because the mean is zero, we have E [s n ] = EX = 0. Now we compute 



n+lj 



E 



E [t 
by induction. 



(s n + X) 2 =E [si] + 2E [s n ] EX + E[X 2 ] = (n + l)var(X). The proof follows 

□ 
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Theorem 5 Let X,X\, . . . be i.i.d. random variables, define fi n := (no • xq + Y17=i x i)/( n ~^ n o) 
and n* = E[X]. Ifv&rX < oo, then E [(fi n - u*) 2 ] = O ((n + 1)~ 2 ) + var(X)/(n + 1). 

Proof We define Y{ := Xi — u*; this can be seen as a new sequence of i.i.d. random variables 
with mean and varY = varX. We also set yo := xq — fi* . Now we have: 



E [(fi n - (1 



*\21 



E 



E 



no ■ yo 



(n + n )" 



(n -yo) 2 + 2n -yo£^+ [J2 Yi 



i=i 



K« = l 



(n + n ) 



0((n + l)~ 2 ) +E 



\i=l / 



(n + n )~ 



(*) 



□ 



0((n + l) 2 ) + nvar(Y)(n + n )" 
= 0((n+l)- 2 ) +var(X)/(n + l), 

where (*) follows by Lemma @J 

The following theorem is of some independent interest. 

Theorem 6 Suppose X, Xi, X2, ■ ■ ■ are i.i.d. with mean 0. If the first k £ N moments of X 
exist, then we have Then E 



(XX=1 X i 



0(n\- 



Remark It follows as a special case of Theorem 2 of |H] that E [\ YJt=i x i\ k ] = 0(na) which 
almost proves this lemma and which would in fact be sufficient for our purposes. We use this 
lemma instead which has an elementary proof. 



Proof We have: 



E 



\i=i 



E 



n n 



Yl ■ ■ ■ Yl ■ ■ ■ x ^ 



ii=l i fc =l 



n n 



^...^/, A - ....V„ 



ii=l it-=l 



We define the frequency sequence of a term to be the sequence of exponents of the different 
random variables in the term, in decreasing order. For a frequency sequence fi,...,f m , we 
have YliLifi = Furthermore, using independence of the different random variables, we can 
rewrite E[Xi x ■ ■ ■ Xi k ] = E[X **] so the value of each term is determined by its frequency 
sequence. By computing the number of terms that share a particular frequency sequence, we 
obtain: 



E 



vi=l 



/ 1+ ...+/ m =fe ^ W 1 '""'^ i 1 



To determine the asymptotic behavior, first observe that the frequency sequence fi,---,f m 
of which the contribution grows fastest in n is the longest sequence, since for that sequence 
the value of (") is maximized as n — ► 00. However, since the mean is zero, we can discard 
all sequences with an element 1, because the for those sequences we have 112=1 -^[-^" l = 
so they contribute nothing to the expectation. Under this constraint, we obtain the longest 
sequence for even k by setting /, = 2 for all 1 < i < m; for odd k by setting /1 = 3 and 
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fi = 2 for all 2 < i < m; in both cases we have m = [|J . The number of terms grows as 
(m) — n m /ml = 0(n m ); for m = [|J we obtain the upper bound O ^nLzJ^. The number of 
frequency sequences is finite and does not depend on n; since the contribution of each one is 
O (raLzJ j ; so must be the sum. □ 



Theorem 7 Let X,X\, . . . be i.i.d. random variables, define fi n := (no • xq + Y^=i ^«)/( n + n o) 
and /x* = // t/ie first k moments of X exist, then E[{jl n — n*) k ] = 0(n~l~2l 

Proof The proof is similar to the proof for Theorem |SJ We define Y{ := Xi — fj,*; this can be 
seen as a new sequence of i.i.d. random variables with mean 0, and yo := x o ~ n* . Now we have: 



E 



*\k 



(An - H*) 



E 



no ■ yo 



On 



p=0 

k 



EC p )in,->,ofE 



On 



p=o v - p/ 



no ■ yo 



(n + n ) 

n 

vi=l 
p -0(nl 



k—p 



k—p 
1 . 



In the last step we used Theorem H3 to bound the expectation. We sum k + 1 terms of which 
the term for p = grows fastest in n, so the expression is 0(n~l~2l ) as required. □ 

Theorem |7| concerns the expectation of the deviation of fi n . We also need a bound on the 
probability of large deviations. To do that we have the following separate theorem: 

Theorem 8 Let X,X±, ...be i.i.d. random variables, define fi n := (no • xq + Y27=l -^i)/( n + n o) 
and \x* = E[X]. Let k G {0, 2, 4, . . .}. If the first k moments exists then P(\fa n — n*\ > 6) = 
O (n-\^5- k ^ 



Proof 



P(|£»-/**l >s) 



*\k 



< E 



(fi n - u*) k 5 k (by Markov's inequality) 



O n-25~ k 



(by Theorem [7J) 



□ 



8.2 Lemma |S Redundancy for Exponential Families 

Lemma 9 Let U be a prequential ML model and A4 be an exponential family as in Theorem^ 
We have 



n-l 



lZ u (n)=Y J E [D{M,« ||M Ai )]. 



i=0 



(Here, the notation fa ~ P means that we take the expectation with respect to P over data 
sequences of length i, of which fa is a function.) 
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Proof We have: 



arginf E P [-hxMJX 71 )) = arginf £ P 



In 



MM") 



arginf D{M I1 * \\ 



In the last step we used Proposition I1UI below. The divergence is minimized when fx = fj,* |14j . 
so we find that: 



Ku{n) = E P [-\uU(X n )] - E P [-\nM^{X n )} = E P 



In 



E P 



y^MAX 



i=0 



n-1 



i=0 



M„{X n 
U(X n ) 

n—l 



£ £ [L>(M M . ||M Aj )]. (20) 



Here, the last step again follows from Proposition 1101 



□ 



Proposition 10 Let X ~ P with mean //*, and Zet M u index an exponential family with suffi- 



cient statistic X , so that M n * exists. We have: 



Ep 



In 



M^(X) 
M e (X) 



D(Mu* || M e ) 



Proof Let n{-) denote the function mapping parameters in the mean value parameterization 
to the natural parameterization. (It is the inverse of the function //(■) which was introduced in 
the discussion of exponential families.) By working out both sides of the equation we find that 
they both reduce to: 

r/Gu*)// + lnZ(r/(/z*)) - 77(1%* - lnZfa(0)). 



□ 



8.3 Lemma lilt Convergence of the sum of the remainder terms 
Lemma 11 Let R(n) be defined as in Then 

R{n) = 0(1). 

Proof We omit irrelevant constants and the term for the first outcome, which is well-defined 
because of our modification of the ML estimator. We abbreviate ^^L>(M U * ||M U ) = D^ k '(fx) as 
in the proof of Theorem ^ First we consider the third order term. We write Es^p to indicate 
that we take the expectation over data which is distributed according to P, of which <5j is a 
function. We use Theorem [7| to bound the expectation of <5j 3 ; under the condition that the first 
three moments exist, which is assumed to be the case, we obtain: 

n— 1 n—l n—l 

£ e b (3) ^)U (3) (^)^J?[|]=i) (3) (^)5;o(r 2 )=o(i). 

1=1 1=1 1=1 

(The constants implicit in the big-ohs are the same across terms.) 

The fourth order term is more involved, because D^(/j,) is not necessarily constant across 
terms. To compute it we first distinguish a number of regions in the value space of Sf let 
A_ = (— oo,0) and let Ao = [0, a) for some constant value a > 0. If the individual outcomes 
X are bounded on the right hand side by a value g then we require that a < g and we define 
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Ai = [a, g); otherwise we define Aj 
convergence of: 

n—1 n—1 



[a + j — 1, a + j) for j > 1. Now we must establish 



V e ^ {4) (/u) = VVp^gAj) £ jAd^I^gA, 

^ ^ ^ JC...D 



1=1 



1=1 j 



If we can establish that the sum converges for all regions Aj for j > 0, then we can use 
a symmetrical argument to establish convergence for A_ as well, so it suffices if we restrict 
ourselves to j > 0. First we show convergence for Ao- In this case, the basic idea is that since 
the remainder D^(li) is well-defined over the interval fx* < fx < Li* + a, we can bound it by its 
extremum on that interval, namely m := sup Mg [ u » )At * +a ) |l)( 4 )(/u)|. Now we get: 



n—1 



£)P(5< G A )E 8i*DW(jj) | Si G A c 



i=l 



< 



n-1 



i=i 



< 



E^ 4 



Using Theorem|7|we find that E[Si ] is 0(i~ 2 ) of which the sum converges. Theorem |7| requires 
that the first four moments of P exist, but this is guaranteed to be the case: either the outcomes 
are bounded from both sides, in which case all moments necessarily exist, or the existence of 
the required moments is part of the condition on the main theorem. 

Now we have to distinguish between the unbounded and bounded cases. First we assume 
that the X are unbounded from above. In this case, we must show convergence of: 

n—1 oo 

EE p (^ G A ^ E [<^ (4 V) i 5 i e a j 
i=i j=i 

We bound this expression from above. The Si in the expectation is at most a + j. Furthermore 
D^(li) = 0(/x fc ~ 6 ) by assumption on the main theorem, where ll G [a + j — 1, a+j). Depending 
on k, both boundaries could maximize this function, but it is easy to check that in both cases 
the resulting function is 0(j k ~ e ). So we get: 

n—1 oo 



i=i j=i 

Since we know from the condition on the main theorem that the first k > 4 moments exist, we can 
apply Theorem0to find that P(|<5;| > a+j-1) = 0(i~\%\ (a+j-l)- k ) = 0{r^)0{j- k ) (since k 
has to be even); plugging this into the equation and simplifying we obtain ]T\ 0(i~v ) ^ ■ 0(j ). 
For k > 4 this expression converges. 

Now we consider the case where the outcomes are bounded from above by g. This case 
is more complicated, since now we have made no extra assumptions as to existence of the 
moments of P. Of course, if the outcomes are bounded from both sides, then all moments 
necessarily exist, but if the outcomes are unbounded from below this may not be true. We 
use a trick to remedy this: we map all outcomes into a new domain in such a way that all 
moments of the transformed variables are guaranteed to exist. Any constant x~ defines a 
mapping g(x) := max{x _ ,:r}. Furthermore we define the random variables Yi := g(X{), the 
initial outcome yo := g(xo) and the mapped analogues of ll* and /tj, respectively: LP is defined 
as the mean of Y under P and pi := (yo ' n o + X^}=i + n o)- Since p-i > pi, we can bound: 



< ^P(£i-//>a) sup 5 1 4 D^(li) 



< VFdft-^l >a + fj,*-^)g 4 sup D^(p) 
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By choosing x small enough, we can bring jj} and /i* arbitrarily close together; in particular 
we can choose x~ such that a + u* — > so that application of Theorem |H1 is safe. It reveals 
that the summed probability is 0(i~^). Now we bound which is 0((g — /u)~ m ) for some 

m £ N by the condition on the main theorem. Here we use that « < fii\ the latter is maximized 
if all outcomes equal the bound g, in which case the estimator equals g — rao(<7 — x o)/{i + n o) = 
g — 0{i~ l ). Putting all of this together, we get sup |.D( 4 )(u)| = 0((g — u) _m ) = 0(i m ); if we 
plug this into the equation we obtain: 

... < ^o(rf) 5 4 o(n = 5 4 ^o(i m -f) 

i i 

This converges if we choose k > 6m. We can do this because the construction of the mapping 
g(-) ensures that all moments exist, and therefore certainly the first 6m. □ 



9 Proof of Theorem [2] 

We use the same conventions as in the proof of Theorem ^ Specifically, we concentrate on 
the random variables X±, X2, ■ ■ ■ rather than Z\, Z2, ■ ■ ., which is justified by Equation (|14jl. 
Let f(x n ) = — lnM^. (x n ) — [inf^ge^ — lnM /1 (x n )]. Within this section, fi(x n ) is defined as 
the ordinary ML estimator. Note that, if x n is such that its ML estimate is defined, then 
f(x n ) = -\nM^(x n ) + ]nM Kxn) (x n ). 

Note d(n) = Ep[f(X n )]. Let h(x) be the carrier of the exponential family under considera- 
tion (see Definition ^) . Without loss of generality, we assume h(x) > for all x in the finite set 
X . Let = n -1 / 2 . We can write 

d(n) = E P [f{X n )] = vr n E P [f(X n ) \ (/x* - fi n ) 2 > a 2 n ] 

+ (l-O E P [f{X n )\{ l f-jl n f<all (21) 

where 7r n = P((fi* — ft n ) 2 > «n)- We determine d(n) by bounding the two terms on the right of 
(|21|). We start with the first term. Since X is bounded, all moments of X exists under P, so 
we can bound 7r n using Theorem |H1 with k = 8 and 5 = a n = n -1 / 4 . (Note that the theorem in 
turn makes use of Theorem |7| which remains valid when we use no = 0.) This gives 



tt„ = 0(n~ 2 ). (22) 



Note that for all x n £ X n , we have 



< f(x n ) < sup f(x n ) < sup -lnM M »(x n ) < nC, (23) 

where C is some constant. Here the first inequality follows because fi maximizes lnM^ x n^(x n ) 
over fj,; the second is immediate; the third follows because we are dealing with discrete data, so 
that Mjx is a probability mass function, and Mf l (x n ) must be < 1. The final inequality follows 
because u* is in the interior of the parameter space, so that the natural parameter n(u*) is in 
the interior of the natural parameter space. Because X is bounded and we assumed h(x) > 
for all x € X, it follows by the definition of exponential families that sup^g^ — In M^* (x) < 00. 

Together (|22j) and (j2.3[) show that the expression on the first line of (|21j) converges to 0, so 
that (|2*T|) reduces to 

d{n) = (1 - ir n )E P [f(X n ) I (fj,* - fi n ) 2 < a 2 n ] + 0( n - 1 ). (24) 
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To evaluate the term inside the expectation further we first Taylor approximate f(x n ) around 
jl n = fi{x n ), for given x n with (n* — fi n ) 2 < o? n = 1/y/n. We get 

f(x n ) = -(/x* - ^j-lnM^) +n\(n* - M 2 I(» n ), (25) 

where I is the Fisher information (as defined in Section[7J) and [i n lies in between \x* and ft, and 
depends on the data x n . Since the first derivative of [i at the ML estimate fx is 0, the first-order 
term is 0. Therefore f(x n ) = ^n(n* — fi n ) 2 1 ([i n ) ■, so that 

\ng{n) inf IU) < E P [f(X n ) \ (jjl* - fi n ) 2 < a 2 n ] < \ng(n) sup 

l ue{u*-a n ,u*+a n ] 2 -an, IJ.*+a n ] 

where we abbreviated g{n) := Ep[(fi* — fi n ) 2 \ (fi* — fi n ) 2 < a^J. Since I(fi) is smooth and 
positive, we can Taylor-approximate it as I(p*) + 0(n~i), so we obtain the bound: 

E P [f(X n ) | (fi*-fi n ) 2 < al\ = ng{n) ( h(fx*) + 0(n^)) . (26) 



To evaluate g(n), note that we have 

E P [(ji* - finf] = 7T n E P [(fi* - fi n ) 2 \ (// - fi n f > a 2 n ] + (1 - 7T„)«/(n). (27) 

Using Theorem El with no = we rewrite the expectation on the left hand side as v&ipX/n. 
Subsequently reordering terms we obtain: 

/ n = (var P X)/n - ir n E P [{ji* - fi n ) 2 | - An) 2 > aj] ^ 

1 - TTn 

Plugging this into bound (f^|) . and multiplying both sides by 1 — 7r n , we get: 

(l-n n )Ep[f(X n )\^*-iln) 2 <a 2 n ] = 

r P X - n7T n Ep[(fi* - fi n ) 2 | (// - fin) 2 > a 2 n \) (\l{»*) + 0{n~^)\ . 



var? 



Since X is bounded, the expectation on the right must lie between and some constant C. 
Using 7r n = 0(n~ 2 ) and the fact that I{n*) = l/varjv/^X (Equation Q), we get 

(1 - 7r n )Ep[f(X n ) | (n* - fin) 2 < a 2 n ] = \ ™ pX + 0(n"l). 

2 var A ^„ X 

The result follows if we combine this with 12 



10 Conclusion and Future Work 

In this paper we established two theorems about the relative redundancy, defined in Section EJ 

1. A particular type of universal code, the prequential ML code or ML plug-in code, exhibits 
behavior that we found unexpected. While other important universal codes such as the 
NML/Shtarkov and Bayesian codes, achieve a regret of \ Inn, where n is the sample size, 
the prequential ML code achieves a relative redundancy of | v ^m*x ^ nn - (Sections Inland 

El) 
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2. At least for finite sample spaces, the relative redundancy is very close to the expected re- 
gret, the difference going to \ v ™ pX x as the sample size increases (Section^ TheoremEJ). 

In future work, we hope to extend this theorem to general 1-parameter exponential families 
with arbitrary sample spaces. 

Under the heading "Related Work" in Section [2] we list a substantial amount of literature in 
which the regret for the prequential ML code is proven to grow with ^ In n. While this may 
seem to contradict our results, in fact it does not: In those articles, settings are considered 
where P S «M, and under such circumstances our own findings predict precisely that behavior. 

The first result is robust with respect to slight variations in the definition of the prequential 
ML code: in our framework the so-called "start-up problem" (the unavailability of an ML esti- 
mate for the first few outcomes) is resolved by introducing fake initial outcomes. Our framework 
thus also covers prequential codes that use other point estimators such as the Bayesian MAP 
and mean estimators defined relative to a large class of reasonable priors. In Section 15.21 we 
conjecture that no matter what in-model estimator is used, the prequential model cannot yield 
a relative redundancy of ^ Inn independently of the variance of the data generating distribution. 
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